Skip to content

Two-Phase Evaluation Architecture#55

Open
oshorefueled wants to merge 16 commits intomainfrom
split_suggest
Open

Two-Phase Evaluation Architecture#55
oshorefueled wants to merge 16 commits intomainfrom
split_suggest

Conversation

@oshorefueled
Copy link
Contributor

@oshorefueled oshorefueled commented Jan 13, 2026

Two-Phase Evaluation Architecture

Summary

Splits LLM evaluation into two distinct phases—detection and suggestion—to improve output quality and reduce hallucinations.

Problem

The previous single-pass approach asked the LLM to simultaneously detect issues AND generate suggestions in one structured call. This led to:

  • Rushed analysis with shallow reasoning
  • Suggestions that didn't align with detected issues
  • Difficulty debugging which phase caused failures

Solution

Phase 1: Detection — Uses an unstructured LLM call with free-form markdown output. The model focuses solely on finding issues with detailed reasoning.

Phase 2: Suggestion — Uses a structured JSON schema call. The model receives the full document context and detected issues, then generates targeted suggestions for each.

Changes

Area Files Changed
LLM Providers Added runPromptUnstructured() to all providers
Detection Phase New DetectionPhaseRunner with markdown parser
Suggestion Phase New SuggestionPhaseRunner with Zod validation
Result Assembly New ResultAssembler merges phases + aggregates tokens
Retry Logic New withRetry() utility for transient failures
Base Evaluator Updated to orchestrate two-phase flow

Stats

27 files changed, 6576 insertions(+), 158 deletions(-)
398 tests passing

Testing

  • Property-based tests for detection parsing accuracy
  • Property-based tests for suggestion-to-issue matching
  • Zod runtime validation on LLM responses
  • Token usage aggregation verification

Breaking Changes

None. Output format remains backward compatible.

Summary by CodeRabbit

Release Notes

  • New Features
    • Implemented two-phase evaluation system that detects issues in content, then generates targeted suggestions for fixes
    • Added automatic retry capability for handling temporary service interruptions
    • Suggestions now generated with full document context for improved coherence and quality

✏️ Tip: You can customize this high-level summary in your review settings.

oshorefueled and others added 16 commits January 13, 2026 13:35
Implements withRetry function for handling transient LLM API failures
with configurable retry attempts and detailed logging.

Changes:
- Add src/evaluators/retry.ts with withRetry function
- Accepts operation, maxRetries (default 3), and context string
- Logs each retry attempt with [vectorlint] prefix for debugging
- Returns RetryResult<T> with data and attempt count
- Export from src/evaluators/index.ts for use by phase runners
- Add comprehensive unit tests (9 tests, all passing)
- Include Property 5 test for retry behavior verification

Purpose: Provides foundational retry logic for detection and suggestion
phases in the two-phase evaluation architecture.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
- Add runPromptUnstructured method to LLMProvider interface
- Implement runPromptUnstructured in OpenAI, Anthropic, Azure OpenAI, and Gemini providers
- Returns raw text response instead of structured JSON
- Includes debug logging, error handling, and token usage tracking
- Add unit tests for unstructured provider methods

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Add detection-phase key to src/evaluators/prompts.json with:
- Guided format instructions for output with markdown example
- Issue N output format (Issue 1, Issue 2, etc.) as section headers
- Required fields: quotedText, contextBefore, contextAfter, line, criterionName, analysis
- {criteria} placeholder for dynamic criteria insertion
- Guidelines for issue ordering, context requirements, and "No issues found" handling

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Add DetectionPhaseRunner class as the first phase of the two-phase
detection/suggestion architecture. The detection phase identifies
issues in content based on evaluation criteria using an unstructured
LLM prompt.

Changes:
- Create src/evaluators/detection-phase.ts with DetectionPhaseRunner
- Implement run(content, criteria, options): Promise<DetectionResult>
- Build detection prompt with criteria from PromptFile via getPrompt()
- Use runPromptUnstructured for LLM call to get free-form text response
- Integrate retry logic from retry.ts for transient failure handling
- Include basic markdown parser for Issue N sections
- Define types: RawDetectionIssue, DetectionResult, DetectionPhaseOptions
- Export from src/evaluators/index.ts

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
…tests

- Fix analysis regex bug in DetectionPhaseRunner.parseIssueSection()
- Add comprehensive test suite with 17 tests including Property 2
- Parser correctly extracts all required fields from markdown responses
- Gracefully handles malformed sections by skipping them
- All tests pass

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Add suggestion-phase prompt template to src/evaluators/prompts.json:
- Universal template for all rule types (check, judge, style guide)
- Placeholders for {content}, {issues}, and {criteria}
- Output format with ## Issue N sections matching detection phase
- Each suggestion includes actionable fix and explanation
- Instructs exactly one suggestion per issue
- Guidelines for preserving voice/tone and avoiding over-editing

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
- Add buildSuggestionLLMSchema() to src/prompts/schema.ts
- Schema defines array of {issueIndex, suggestion, explanation}
- Strict mode enabled for structured output validation
- Add SuggestionLLMResult TypeScript type
- Schema name: vectorlint_suggestion_result

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
- Create SuggestionPhaseRunner class in src/evaluators/suggestion-phase.ts
- Implement run(content, issues, criteria) method using runPromptStructured
- Add buildPrompt method to format content, issues, and criteria
- Add formatIssues method to convert RawDetectionIssue to markdown
- Integrate withRetry logic for transient failure handling
- Export SuggestionPhaseRunner, Suggestion, SuggestionResult, SuggestionPhaseOptions types
- Add comprehensive test suite with 12 tests in tests/suggestion-phase.test.ts
- Property 4 tests verify suggestion-to-issue matching by index
- Update PRD to mark feature as complete
- Update progress.txt with implementation details

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Implemented ResultAssembler class that combines detection and suggestion
phase results into final CheckResult/JudgeResult formats.

- assembleCheckResult: Merges detection issues with suggestions for
  check-style evaluation results
- assembleJudgeResult: Groups violations by criterion for judge-style
  evaluation results
- aggregateTokenUsage: Correctly combines token usage from both phases

Property 6: Result schema conformance - all tests verify output matches
CheckResult and JudgeResult schema requirements

Property 7: Token usage aggregation - tests verify correct summation
of input/output tokens from both phases

All 22 tests pass.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
- Add DetectionPhaseRunner, SuggestionPhaseRunner, and ResultAssembler to BaseEvaluator
- Rewrite runCheckEvaluation to use detection→suggestion→assembly flow
- Rewrite runJudgeEvaluation to use detection→suggestion→assembly flow
- Pass full document to suggestion phase even when chunking (Property 3)
- Add buildCriteriaString helper for prompt criteria formatting
- Update ResultAssembler to handle string strictness and PromptCriterionSpec
- Add Property 1 and Property 3 tests in base-evaluator-two-phase.test.ts

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
- Fix azure-openai-provider.test.ts mock syntax and error expectations
- Fix anthropic-provider.test.ts mock variable scope and instanceof checks
- Fix openai-provider.test.ts mock variable scope
- Update scoring-types.test.ts for two-phase detection/suggestion flow:
  - Mock both runPromptUnstructured (detection) and runPromptStructured (suggestion)
  - Fix detection response format to match parser expectations
  - Update score expectations for new density-based scoring formula
  - Add beforeEach hook to clear mocks between tests

All 394 tests pass.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
- Added "Two-Phase Detection/Suggestion Architecture" section explaining all three phases
- Documented key components table with file locations and purposes
- Listed all 7 property tests that validate the architecture
- Updated LLM Provider Methods subsection with both structured and unstrained calls
- Updated project structure to include new evaluator files
- Marked final PRD task as complete

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
- Remove unused imports from test files (DetectionResult, ResultAssemblerOptions, RetryResult, DefaultRequestBuilder, JudgeLLMResult, CheckLLMResult)
- Replace unsafe 'any' type assertions with proper type assertions using 'unknown' intermediate type
- Fix variable naming conventions (camelCase to UPPER_CASE for constants)
- Fix async arrow function warnings by removing async keyword where await is not used
- Add eslint-disable comments for unbound-method warnings on vi.mocked() calls
- Add eslint-disable comments for unsafe type assignments that are necessary for test mocking
- Remove unused imports from src/evaluators/base-evaluator.ts (unused schema and scoring functions)
- Fix variable naming in src/evaluators/result-assembler.ts (final_score to finalScore)

All 394 tests still pass. Lint is now clean.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Add runtime validation using Zod to ensure LLM responses in the suggestion
phase conform to the expected schema, providing an additional layer of safety
beyond the structured output schema.

Changes:
- Add SUGGESTION_LLM_RESULT_SCHEMA Zod schema in src/prompts/schema.ts
  - Validates suggestions array with issueIndex (positive int), suggestion
    (non-empty string), and explanation (non-empty string)
- Update src/evaluators/suggestion-phase.ts to use Zod validation
  - Import SUGGESTION_LLM_RESULT_SCHEMA
  - Parse llmResult.data before using it
- Add 4 test cases in tests/suggestion-phase.test.ts for validation:
  - Missing required field
  - Wrong type for issueIndex
  - Invalid value (zero issueIndex)
  - Empty string for suggestion

All 398 tests pass (increased from 394).

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Update catch block from 'catch {' to 'catch (_e: unknown)' with
explanatory comment as per PRD task requirements.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
…ng RetryResult wrapper

- Inline buildCheckMessage(), buildCriterionSummary(), buildCriterionReasoning(),
  normalizeStrictness(), and calculateCriterionScore() in result-assembler.ts
- Remove RetryResult<T> wrapper from retry.ts; withRetry() now returns T directly
- Update detection-phase.ts and suggestion-phase.ts to use new withRetry signature
- Remove @example JSDoc blocks from exported classes (ResultAssembler,
  DetectionPhaseRunner, SuggestionPhaseRunner)
- Update tests/retry.test.ts to use direct return value instead of RetryResult

Line count reduced from ~955 to 822 lines (-133 lines, ~14% reduction)

All 398 tests pass, build compiles, lint clean.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
@coderabbitai
Copy link

coderabbitai bot commented Jan 13, 2026

📝 Walkthrough

Walkthrough

This PR implements a two-phase evaluation architecture that replaces single-pass evaluation. It introduces detection (unstructured LLM calls to identify issues), suggestion (structured LLM calls to propose fixes), result assembly, a retry utility with logging, and extends LLM providers with unstructured prompt support. Comprehensive tests validate the new workflow.

Changes

Cohort / File(s) Change Summary
LLM Provider Extension
src/providers/llm-provider.ts, src/providers/openai-provider.ts, src/providers/anthropic-provider.ts, src/providers/azure-openai-provider.ts, src/providers/gemini-provider.ts
Added runPromptUnstructured method to the LLMProvider interface and implemented across all four providers; returns raw text with logging, error handling, and token tracking.
Detection Phase
src/evaluators/detection-phase.ts
New DetectionPhaseRunner class that uses unstructured LLM calls to identify issues; includes markdown parser for structured issue extraction (quotedText, contextBefore/After, line, criterionName, analysis).
Suggestion Phase
src/evaluators/suggestion-phase.ts
New SuggestionPhaseRunner class that uses structured LLM calls with JSON schema to generate per-issue suggestions based on detection results and full document context.
Result Assembly
src/evaluators/result-assembler.ts
New ResultAssembler class that merges detection and suggestion results into CheckResult/JudgeResult formats; includes token usage aggregation across both phases.
Supporting Infrastructure
src/evaluators/retry.ts, src/prompts/schema.ts, src/evaluators/prompts.json, src/evaluators/index.ts
Added withRetry utility for transient LLM failures; new SUGGESTION_LLM_RESULT_SCHEMA for Zod validation; new detection-phase and suggestion-phase prompt templates; updated evaluators index exports.
Core Integration
src/evaluators/base-evaluator.ts
Refactored to orchestrate two-phase flow: detection on chunks, suggestion on full document, assembly of final results; replaced per-chunk scoring with unified pipeline.
Documentation & Config
AGENTS.md, .gitignore
Documented three-phase evaluation architecture and component roles; uncommitted .kiro/ directory.
Test Coverage
tests/detection-phase.test.ts, tests/suggestion-phase.test.ts, tests/result-assembler.test.ts, tests/retry.test.ts, tests/base-evaluator-two-phase.test.ts, tests/*-provider.test.ts, tests/scoring-types.test.ts
Comprehensive test suites added/updated for all new components; provider tests extended for unstructured flow validation.

Sequence Diagram(s)

sequenceDiagram
    participant Client as Evaluator (BaseEvaluator)
    participant DetectionRunner as DetectionPhaseRunner
    participant LLM as LLM Provider
    participant SuggestionRunner as SuggestionPhaseRunner
    participant Assembler as ResultAssembler

    Client->>DetectionRunner: run(content chunk, criteria)
    loop For each content chunk
        DetectionRunner->>LLM: runPromptUnstructured(detection prompt)
        LLM-->>DetectionRunner: raw text response + usage
    end
    DetectionRunner-->>Client: RawDetectionIssue[], TokenUsage

    alt Issues detected
        Client->>SuggestionRunner: run(full document, issues, criteria)
        SuggestionRunner->>LLM: runPromptStructured(suggestion prompt)
        LLM-->>SuggestionRunner: JSON suggestions + usage
        SuggestionRunner-->>Client: Suggestion[], TokenUsage
        
        Client->>Assembler: assembleCheckResult/assembleJudgeResult(issues, suggestions)
        Assembler->>Assembler: mergeIssuesWithSuggestions()
        Assembler->>Assembler: aggregateTokenUsage()
        Assembler-->>Client: CheckResult/JudgeResult with final_score
    else No issues
        Client->>Assembler: assembleCheckResult/assembleJudgeResult(no issues, [])
        Assembler-->>Client: CheckResult/JudgeResult (perfect score)
    end
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~50 minutes

Possibly related PRs

Suggested reviewers

  • ayo6706

Poem

🐰 Hop, hop, detection found the flaws so clear,
Then suggestions whisper fixes to your ear,
Two phases dance—unstructured, then structured tight,
Result assembled perfect, shining bright!

🚥 Pre-merge checks | ✅ 2 | ❌ 1
❌ Failed checks (1 warning)
Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 25.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (2 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title 'Two-Phase Evaluation Architecture' accurately and concisely summarizes the main architectural change—splitting LLM evaluation into detection and suggestion phases.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing touches
  • 📝 Generate docstrings

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 6

🤖 Fix all issues with AI agents
In @src/evaluators/result-assembler.ts:
- Around line 206-239: The comment claiming normalized_score maps 1–4 to 1–10 is
incorrect: normalized_score = score * 2.5 yields 2.5–10; update the code in the
result assembly to either (A) correct the comment to state the actual 2.5–10
range for normalized_score, or (B) change the normalization formula to map 1→1
and 4→10 by replacing score * 2.5 with the linear mapping (score - 1) * 3 + 1
(apply this change where normalized_score is computed and keep the comment
consistent).

In @src/evaluators/retry.ts:
- Around line 30-31: Initialize and/or properly handle the case when maxRetries
can be 0 so lastError is never left undefined; either validate maxRetries to be
>= 1 up front or set lastError to a meaningful default Error before the retry
loop (or after the loop check) and then throw that descriptive error instead of
undefined. Update the logic around maxRetries, lastError, and the retry loop in
retry.ts (symbols: maxRetries, lastError, the retry loop) so callers passing
maxRetries: 0 get a clear error message rather than an undefined throw.

In @src/providers/gemini-provider.ts:
- Around line 96-138: The model instance created in the constructor was set with
responseMimeType: "application/json" causing runPromptUnstructured to still
request JSON; update runPromptUnstructured to create or obtain a temporary
unstructured model (e.g., clone or instantiate a new Model/Client without
responseMimeType or with responseMimeType: "text/markdown" or "text/plain") and
call generateContent on that unstructured model instead of this.model; ensure
you reuse the same client/config values (model name, temperature) and clean up
or reuse the temporary model reference as appropriate to avoid changing the
constructor-initialized JSON model used elsewhere.

In @tests/azure-openai-provider.test.ts:
- Around line 362-419: The Gemini provider is logging debug output with
console.error while other providers use console.log; locate the Gemini provider
class (e.g., GeminiProvider) and replace any debug console.error calls in its
debug/logging code paths (methods like runPromptUnstructured /
runPromptStructured or any place using console.error for "[vectorlint]" debug
messages) with console.log, preserving the exact log message and payload shape
so tests and consumers remain consistent.
🧹 Nitpick comments (9)
src/providers/openai-provider.ts (1)

181-182: Consider extracting shared logic to reduce duplication.

The runPromptUnstructured method shares ~80% of its code with runPromptStructured (params building, debug logging, error handling, response validation). Additionally, buildPromptBodyForStructured is used in an unstructured context, which is slightly misleading.

Consider extracting common logic into private helper methods:

private buildBaseParams(content: string, promptText: string): OpenAI.Chat.Completions.ChatCompletionCreateParams { ... }
private handleApiError(e: unknown): never { ... }
private logDebugInfo(params: ..., response?: ...): void { ... }

This would keep the provider thin per coding guidelines while reducing maintenance burden.

src/prompts/schema.ts (1)

97-131: JSON schema and Zod schema have validation inconsistencies.

The JSON schema allows:

  • issueIndex as any number (including floats and zero)
  • Empty strings for suggestion and explanation

But the Zod schema requires:

  • issueIndex to be a positive integer
  • Non-empty strings (.min(1))

If the LLM returns {"issueIndex": 0, "suggestion": ""}, it passes the JSON schema but fails Zod validation at runtime.

♻️ Align JSON schema with Zod constraints
 properties: {
   issueIndex: {
-    type: "number",
+    type: "integer",
+    minimum: 1,
     description: "The index of the issue this suggestion addresses (1-based, matching Issue 1, Issue 2, etc.)",
   },
   suggestion: {
     type: "string",
+    minLength: 1,
     description: "Specific, actionable text to replace the problematic content",
   },
   explanation: {
     type: "string",
+    minLength: 1,
     description: "Brief explanation of how this suggestion addresses the issue",
   },
 },

Also applies to: 175-183

AGENTS.md (1)

165-165: Capitalize "Markdown" as a proper noun.

-- LLM returns free-form markdown text with `## Issue N` sections
+- LLM returns free-form Markdown text with `## Issue N` sections
tests/suggestion-phase.test.ts (1)

130-154: Consider testing formatIssues through the public interface instead of type assertion.

While testing private methods can be useful, the type assertion pattern (runner as unknown as { formatIssues: ... }) is fragile. If the method name or signature changes, these tests will fail at runtime rather than compile time. Consider either:

  1. Testing this behavior through run() by verifying the prompt content
  2. Making formatIssues protected if it's part of the class's testable contract
src/evaluators/suggestion-phase.ts (1)

149-152: Potential issue: Template replacement may be fragile if content contains placeholder strings.

The simple replace() calls could have unintended effects if the content, issues text, or criteria string contains literal {content}, {issues}, or {criteria} substrings. While unlikely in practice, this could cause partial replacements or double-replacements.

Consider using a more robust templating approach or replacing placeholders sequentially in a way that prevents re-matching.

♻️ Optional: Use more unique placeholders or sequential replacement
   private buildPrompt(
     content: string,
     issues: RawDetectionIssue[],
     criteria: string
   ): string {
     const template = getPrompt("suggestion-phase");
 
     // Format issues for inclusion in the prompt
     const issuesText = this.formatIssues(issues);
 
-    return template
-      .replace("{content}", content)
-      .replace("{issues}", issuesText)
-      .replace("{criteria}", criteria);
+    // Replace placeholders in sequence to avoid re-matching
+    let result = template.replace("{criteria}", criteria);
+    result = result.replace("{issues}", issuesText);
+    result = result.replace("{content}", content);
+    return result;
   }
tests/base-evaluator-two-phase.test.ts (2)

92-95: Weak assertion: toBeGreaterThanOrEqual(0) is always true for array length.

The assertion on line 94 doesn't verify meaningful behavior—structuredCalls.length will always be >= 0. Consider removing this assertion or replacing it with a more specific expectation based on whether issues were detected.

♻️ Consider removing or strengthening the assertion
       // The mock LLM's runPromptUnstructured should have been called for detection
       const unstructuredCalls = (mockLLM.runPromptUnstructured as ReturnType<typeof vi.fn>).mock.calls;
       expect(unstructuredCalls.length).toBeGreaterThan(0);
 
-      // Verify structured call for suggestion was also made if issues were found
-      const structuredCalls = (mockLLM.runPromptStructured as ReturnType<typeof vi.fn>).mock.calls;
-      expect(structuredCalls.length).toBeGreaterThanOrEqual(0);
+      // When no issues are detected (default mock returns "No issues found."),
+      // suggestion phase should not be called
+      const structuredCalls = (mockLLM.runPromptStructured as ReturnType<typeof vi.fn>).mock.calls;
+      expect(structuredCalls.length).toBe(0);

460-468: Incomplete mock detection response may not parse correctly.

The detection response on line 464 is missing contextBefore and contextAfter fields that other tests include. If the detection parser requires these fields, this test may not accurately simulate real behavior.

♻️ Consider using full detection response format for consistency
       (mockLLM.runPromptUnstructured as ReturnType<typeof vi.fn>).mockImplementation(
         () => {
           detectionCallCount++;
           return {
-            data: "## Issue 1\n\n**quotedText:**\n> problem\n\n**line:** 42\n\n**criterionName:** Criterion 1\n\n**analysis:**\nIssue",
+            data: `## Issue 1
+
+**quotedText:**
+> problem
+
+**contextBefore:**
+before
+
+**contextAfter:**
+after
+
+**line:** 42
+
+**criterionName:** Criterion 1
+
+**analysis:**
+This is an issue`,
             usage: { inputTokens: 100, outputTokens: 50 },
           };
         }
       );
src/evaluators/result-assembler.ts (2)

57-61: Consider the default value for totalWordCount.

Defaulting totalWordCount to 1 avoids division by zero but will produce a heavily penalized score if the caller forgets to pass the actual word count. A single violation would result in a score calculation of 10 - ((1/1)*100*strictness)*2, which is clamped to 1.

Consider logging a warning or using a more representative default (e.g., 100) to make misconfiguration more obvious during testing.


305-311: Clarify the 1-based indexing convention.

The mapping uses index + 1 to look up suggestions, implying Suggestion.issueIndex is 1-based while the issues array is 0-based. Consider adding a brief comment to make this convention explicit, as it's a key coupling between detection and suggestion phases.

     return issues.map((issue, index) => {
+      // Suggestions use 1-based issueIndex; issues array is 0-based
       const matchingSuggestion = suggestionMap.get(index + 1);
📜 Review details

Configuration used: defaults

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 03ed298 and fd3fe1d.

📒 Files selected for processing (27)
  • .gitignore
  • AGENTS.md
  • ralph/prd.json
  • ralph/progress.txt
  • src/evaluators/base-evaluator.ts
  • src/evaluators/detection-phase.ts
  • src/evaluators/index.ts
  • src/evaluators/prompts.json
  • src/evaluators/result-assembler.ts
  • src/evaluators/retry.ts
  • src/evaluators/suggestion-phase.ts
  • src/prompts/schema.ts
  • src/providers/anthropic-provider.ts
  • src/providers/azure-openai-provider.ts
  • src/providers/gemini-provider.ts
  • src/providers/llm-provider.ts
  • src/providers/openai-provider.ts
  • tests/anthropic-provider.test.ts
  • tests/azure-openai-provider.test.ts
  • tests/base-evaluator-two-phase.test.ts
  • tests/detection-phase.test.ts
  • tests/gemini-provider.test.ts
  • tests/openai-provider.test.ts
  • tests/result-assembler.test.ts
  • tests/retry.test.ts
  • tests/scoring-types.test.ts
  • tests/suggestion-phase.test.ts
🧰 Additional context used
📓 Path-based instructions (3)
src/**/*.ts

📄 CodeRabbit inference engine (AGENTS.md)

src/**/*.ts: Use TypeScript ESM with explicit imports and narrow types
Use 2-space indentation; avoid trailing whitespace
Maintain strict TypeScript with no any; use unknown + schema validation for external data
Use custom error types with proper inheritance; catch blocks use unknown type

Files:

  • src/evaluators/retry.ts
  • src/evaluators/index.ts
  • src/providers/gemini-provider.ts
  • src/evaluators/suggestion-phase.ts
  • src/providers/azure-openai-provider.ts
  • src/providers/llm-provider.ts
  • src/prompts/schema.ts
  • src/evaluators/base-evaluator.ts
  • src/evaluators/result-assembler.ts
  • src/providers/openai-provider.ts
  • src/evaluators/detection-phase.ts
  • src/providers/anthropic-provider.ts
src/providers/**/*.ts

📄 CodeRabbit inference engine (AGENTS.md)

src/providers/**/*.ts: Depend on LLMProvider and SearchProvider interfaces; keep providers thin (transport only)
Inject RequestBuilder via provider constructor to avoid coupling

Files:

  • src/providers/gemini-provider.ts
  • src/providers/azure-openai-provider.ts
  • src/providers/llm-provider.ts
  • src/providers/openai-provider.ts
  • src/providers/anthropic-provider.ts
tests/**/*.test.ts

📄 CodeRabbit inference engine (AGENTS.md)

tests/**/*.test.ts: Write tests using Vitest framework with focus on config parsing, file discovery, schema/structured output, and locator
Use dependency injection in tests: mock providers; do not hit network in unit tests

Files:

  • tests/result-assembler.test.ts
  • tests/suggestion-phase.test.ts
  • tests/detection-phase.test.ts
  • tests/scoring-types.test.ts
  • tests/retry.test.ts
  • tests/base-evaluator-two-phase.test.ts
  • tests/openai-provider.test.ts
  • tests/azure-openai-provider.test.ts
  • tests/gemini-provider.test.ts
  • tests/anthropic-provider.test.ts
🧠 Learnings (5)
📚 Learning: 2025-12-28T19:43:51.189Z
Learnt from: CR
Repo: TRocket-Labs/vectorlint PR: 0
File: AGENTS.md:0-0
Timestamp: 2025-12-28T19:43:51.189Z
Learning: Applies to tests/**/*.test.ts : Write tests using Vitest framework with focus on config parsing, file discovery, schema/structured output, and locator

Applied to files:

  • tests/result-assembler.test.ts
  • tests/suggestion-phase.test.ts
  • tests/detection-phase.test.ts
  • tests/scoring-types.test.ts
  • tests/retry.test.ts
  • tests/base-evaluator-two-phase.test.ts
  • tests/azure-openai-provider.test.ts
  • tests/gemini-provider.test.ts
📚 Learning: 2025-12-28T19:43:51.189Z
Learnt from: CR
Repo: TRocket-Labs/vectorlint PR: 0
File: AGENTS.md:0-0
Timestamp: 2025-12-28T19:43:51.189Z
Learning: Applies to tests/**/*.test.ts : Use dependency injection in tests: mock providers; do not hit network in unit tests

Applied to files:

  • tests/detection-phase.test.ts
  • tests/azure-openai-provider.test.ts
  • tests/gemini-provider.test.ts
  • tests/anthropic-provider.test.ts
📚 Learning: 2025-12-28T19:43:51.189Z
Learnt from: CR
Repo: TRocket-Labs/vectorlint PR: 0
File: AGENTS.md:0-0
Timestamp: 2025-12-28T19:43:51.189Z
Learning: Applies to src/providers/**/*.ts : Depend on `LLMProvider` and `SearchProvider` interfaces; keep providers thin (transport only)

Applied to files:

  • tests/scoring-types.test.ts
  • src/providers/llm-provider.ts
  • src/evaluators/base-evaluator.ts
  • AGENTS.md
📚 Learning: 2025-12-28T19:43:51.189Z
Learnt from: CR
Repo: TRocket-Labs/vectorlint PR: 0
File: AGENTS.md:0-0
Timestamp: 2025-12-28T19:43:51.189Z
Learning: Applies to src/boundaries/**/*.ts : Use Zod schemas for boundary validation of all external data (files, CLI, env, APIs) at system boundaries

Applied to files:

  • src/prompts/schema.ts
📚 Learning: 2025-12-28T19:43:51.189Z
Learnt from: CR
Repo: TRocket-Labs/vectorlint PR: 0
File: AGENTS.md:0-0
Timestamp: 2025-12-28T19:43:51.189Z
Learning: Applies to evals/**/*.md : Evals must include YAML frontmatter; the tool appends evidence instructions automatically

Applied to files:

  • AGENTS.md
🧬 Code graph analysis (20)
src/evaluators/retry.ts (1)
src/evaluators/index.ts (2)
  • RetryOptions (30-30)
  • withRetry (30-30)
src/evaluators/index.ts (4)
src/evaluators/evaluator-registry.ts (1)
  • EvaluatorRegistry (27-59)
src/output/reporter.ts (2)
  • printEvaluationSummaries (138-175)
  • EvaluationSummary (7-11)
src/cli/orchestrator.ts (1)
  • extractAndReportCriterion (257-463)
src/evaluators/accuracy-evaluator.ts (1)
  • TechnicalAccuracyEvaluator (48-239)
src/providers/gemini-provider.ts (3)
src/providers/llm-provider.ts (1)
  • LLMResult (3-6)
src/errors/index.ts (1)
  • handleUnknownError (46-51)
tests/evaluator.test.ts (1)
  • runPromptStructured (10-14)
src/evaluators/suggestion-phase.ts (4)
src/providers/token-usage.ts (1)
  • TokenUsage (1-4)
src/prompts/schema.ts (2)
  • buildSuggestionLLMSchema (97-131)
  • SUGGESTION_LLM_RESULT_SCHEMA (175-183)
src/evaluators/detection-phase.ts (1)
  • RawDetectionIssue (21-34)
src/evaluators/retry.ts (1)
  • withRetry (26-57)
src/providers/azure-openai-provider.ts (3)
src/providers/llm-provider.ts (1)
  • LLMResult (3-6)
src/errors/index.ts (1)
  • handleUnknownError (46-51)
src/boundaries/api-client.ts (1)
  • validateApiResponse (9-20)
tests/detection-phase.test.ts (1)
src/evaluators/detection-phase.ts (2)
  • DetectionPhaseRunner (69-227)
  • RawDetectionIssue (21-34)
tests/scoring-types.test.ts (4)
src/providers/llm-provider.ts (1)
  • LLMProvider (8-11)
src/evaluators/base-evaluator.ts (1)
  • BaseEvaluator (42-297)
tests/evaluator.test.ts (2)
  • runPromptStructured (10-14)
  • FakeProvider (8-15)
tests/provider-factory.test.ts (1)
  • it (227-292)
src/providers/llm-provider.ts (1)
tests/evaluator.test.ts (1)
  • runPromptStructured (10-14)
tests/retry.test.ts (1)
src/evaluators/retry.ts (1)
  • withRetry (26-57)
src/prompts/schema.ts (1)
src/output/rdjson-formatter.ts (1)
  • RdJsonSuggestion (35-47)
src/evaluators/base-evaluator.ts (5)
src/evaluators/evaluator.ts (1)
  • Evaluator (7-9)
src/evaluators/result-assembler.ts (1)
  • ResultAssembler (43-347)
src/providers/token-usage.ts (1)
  • TokenUsage (1-4)
src/chunking/index.ts (1)
  • countWords (4-4)
src/cli/orchestrator.ts (3)
  • evaluateFile (757-907)
  • extractAndReportCriterion (257-463)
  • routePromptResult (544-705)
src/providers/openai-provider.ts (2)
src/providers/llm-provider.ts (1)
  • LLMResult (3-6)
src/errors/index.ts (1)
  • handleUnknownError (46-51)
tests/openai-provider.test.ts (2)
src/schemas/openai-responses.ts (1)
  • OpenAIResponse (43-43)
src/providers/openai-provider.ts (1)
  • OpenAIProvider (23-270)
tests/azure-openai-provider.test.ts (1)
src/providers/azure-openai-provider.ts (1)
  • AzureOpenAIProvider (23-223)
AGENTS.md (4)
src/evaluators/accuracy-evaluator.ts (1)
  • TechnicalAccuracyEvaluator (48-239)
tests/evaluator.test.ts (3)
  • FakeProvider (8-15)
  • runPromptStructured (10-14)
  • it (39-53)
src/cli/orchestrator.ts (2)
  • evaluateFile (757-907)
  • runPromptEvaluation (713-752)
src/evaluators/evaluator.ts (1)
  • Evaluator (7-9)
tests/gemini-provider.test.ts (1)
src/providers/gemini-provider.ts (1)
  • GeminiProvider (20-139)
src/evaluators/detection-phase.ts (4)
src/providers/token-usage.ts (1)
  • TokenUsage (1-4)
src/evaluators/retry.ts (1)
  • withRetry (26-57)
tests/evaluator.test.ts (2)
  • runPromptStructured (10-14)
  • FakeProvider (8-15)
src/cli/types.ts (1)
  • RunPromptEvaluationParams (113-119)
tests/anthropic-provider.test.ts (3)
tests/schemas/mock-schemas.ts (1)
  • MockAnthropicClient (66-70)
src/providers/anthropic-provider.ts (1)
  • AnthropicProvider (31-342)
tests/anthropic-e2e.test.ts (3)
  • vi (62-71)
  • default (18-52)
  • MockAnthropicClient (20-24)
ralph/prd.json (1)
tests/evaluator.test.ts (2)
  • runPromptStructured (10-14)
  • FakeProvider (8-15)
src/providers/anthropic-provider.ts (3)
src/providers/llm-provider.ts (1)
  • LLMResult (3-6)
src/errors/index.ts (1)
  • handleUnknownError (46-51)
src/schemas/anthropic-responses.ts (1)
  • isTextBlock (44-46)
🪛 LanguageTool
AGENTS.md

[uncategorized] ~165-~165: Did you mean the formatting language “Markdown” (= proper noun)?
Context: ...tUnstructured) - LLM returns free-form markdown text with ## Issue N` sections - Conte...

(MARKDOWN_NNP)

ralph/progress.txt

[uncategorized] ~54-~54: Did you mean the formatting language “Markdown” (= proper noun)?
Context: ...ded format instructions for output with markdown code block example - Output template sp...

(MARKDOWN_NNP)


[uncategorized] ~61-~61: Did you mean the formatting language “Markdown” (= proper noun)?
Context: ...ontent. The unstructured output format (markdown with Issue N sections) is designed to b...

(MARKDOWN_NNP)


[uncategorized] ~75-~75: Did you mean the formatting language “Markdown” (= proper noun)?
Context: ...sient failure handling - Includes basic markdown parser to extract issues from LLM respo...

(MARKDOWN_NNP)


[uncategorized] ~168-~168: Did you mean the formatting language “Markdown” (= proper noun)?
Context: ...hod converts RawDetectionIssue array to markdown format for prompt inclusion - Uses `run...

(MARKDOWN_NNP)


[style] ~234-~234: This adverb was used twice in the sentence. Consider removing one of them or replacing them with a synonym.
Context: ...tly typed - Optional fields handled correctly - CheckResult and JudgeResult outpu...

(ADVERB_REPETITION_PREMIUM)


[style] ~240-~240: This adverb was used twice in the sentence. Consider removing one of them or replacing them with a synonym.
Context: ...n suggestion is undefined - Returns only suggestion usage when detection is unde...

(ADVERB_REPETITION_PREMIUM)


[style] ~347-~347: The words ‘explanations’ and ‘explaining’ are quite similar. Consider replacing ‘explaining’ with a different word.
Context: ...haseRunnerhas detailed JSDoc comments explaining purpose -SuggestionPhaseRunner` has...

(VERB_NOUN_SENT_LEVEL_REP)


[style] ~430-~430: This phrase might be redundant. Consider either removing or replacing the adjective ‘additional’.
Context: ...rs Purpose: Zod runtime validation adds an additional layer of safety for the suggestion phase LLM ...

(ADD_AN_ADDITIONAL)

🔇 Additional comments (73)
.gitignore (1)

11-11: Verify intent: .kiro/ directory is no longer ignored.

Commenting out this entry means the .kiro/ directory could now be tracked by Git if present. This change appears unrelated to the two-phase evaluation architecture. Was this intentional, or should this remain ignored to avoid committing tool-specific files?

src/providers/llm-provider.ts (1)

8-11: LGTM!

Clean interface extension adding unstructured prompt support. The signature mirrors runPromptStructured appropriately (minus the schema parameter), and LLMResult<string> is the correct return type for raw text responses. This keeps providers thin as transport-only components per coding guidelines.

src/evaluators/prompts.json (1)

3-4: LGTM!

The new detection-phase and suggestion-phase prompts are well-structured and align with the two-phase evaluation architecture. The detection prompt specifies a clear markdown output format for parsing, while the suggestion prompt includes appropriate template placeholders ({content}, {issues}, {criteria}) for runtime substitution.

src/providers/openai-provider.ts (1)

218-268: LGTM on error handling and token usage tracking.

The error handling properly distinguishes between rate limits, authentication, and general API errors. Token usage mapping correctly extracts prompt_tokensinputTokens and completion_tokensoutputTokens, maintaining consistency with the structured path.

src/providers/anthropic-provider.ts (1)

242-341: Implementation is correct and consistent.

The unstructured method properly:

  • Omits tool schema and tool_choice from the request
  • Reuses the validated response schema which handles both structured and unstructured content blocks
  • Correctly filters for text blocks and concatenates them
  • Handles all Anthropic-specific error types consistently with the structured method
src/providers/azure-openai-provider.ts (1)

142-222: LGTM - Implementation correctly omits JSON schema for unstructured responses.

The key difference from runPromptStructured (no response_format parameter) is correct for unstructured text output. Response validation, error handling, and usage tracking are consistent with the structured method.

ralph/prd.json (1)

1-207: PRD tracking file documents implementation tasks correctly.

This is project management metadata that accurately tracks the two-phase architecture implementation. All tasks are marked complete and the JSON structure is valid.

ralph/progress.txt (1)

1-470: Progress log provides comprehensive implementation documentation.

This development log accurately documents the incremental implementation of the two-phase architecture. The static analysis hints about "Markdown" capitalization are minor style nits that don't affect the documentation's utility.

src/prompts/schema.ts (1)

160-183: Well-structured type and runtime validation.

The SuggestionLLMResult type and SUGGESTION_LLM_RESULT_SCHEMA are well-documented and follow the coding guidelines for using Zod schemas for boundary validation of external data.

tests/openai-provider.test.ts (3)

1021-1067: Comprehensive unstructured response handling tests.

The test suite properly validates the new runPromptUnstructured method including raw text extraction and token usage tracking. The tests follow the Vitest framework guidelines and use dependency injection with mocked providers.


1069-1117: Verify intentional behavioral difference for empty/null content.

The unstructured path returns an empty string for empty/null content (lines 1091, 1116), while the structured path throws "Empty response from OpenAI API (no content)." (lines 596-598). This asymmetry may be intentional since unstructured responses don't require content, but worth confirming this design decision.


1336-1359: Good verification that unstructured calls omit response_format.

This test correctly ensures the unstructured path doesn't include response_format, which would force JSON mode. This is essential for the two-phase architecture where detection uses free-form markdown responses.

tests/retry.test.ts (3)

1-11: Solid test foundation for retry utility.

The initial tests correctly validate the happy path (first attempt success) and verify the operation is called exactly once. The mock setup is clean and follows Vitest best practices.


28-49: Good coverage of retry exhaustion and default behavior.

The tests properly validate that:

  • All retries are exhausted before throwing
  • The last error is propagated
  • Default maxRetries of 3 is applied when not specified

This aligns with the implementation in src/evaluators/retry.ts.


112-136: Property test validates retry success within limit.

The test effectively demonstrates that the retry mechanism correctly returns successful results when the operation succeeds before exhausting retries. The manual counter approach clearly shows the attempt progression.

tests/detection-phase.test.ts (4)

7-14: Clean mock provider factory for detection phase tests.

The mock provider correctly implements only runPromptUnstructured since DetectionPhaseRunner exclusively uses unstructured LLM calls. The factory pattern enables easy customization of mock responses per test.


19-52: Thorough single issue parsing test.

The test validates all required fields are correctly extracted from the markdown format. The expected output structure matches the RawDetectionIssue interface from the implementation.


271-334: Good integration tests for run() method.

The tests properly verify:

  • LLM provider receives content and built prompt
  • Detection results include parsed issues, hasIssues flag, raw response, and usage data
  • Retry logic is appropriately deferred to retry.test.ts

This follows the testing guideline of focusing on component integration rather than duplicating unit tests.


430-529: Excellent robustness testing for malformed sections.

The test at lines 430-529 validates that the parser:

  • Skips issues missing required fields (quotedText, line)
  • Preserves valid issues even when surrounded by invalid ones
  • Correctly identifies 3 valid issues out of 5

This ensures the detection phase gracefully handles LLM output variability.

AGENTS.md (3)

14-17: Good documentation of new evaluator components.

The directory structure documentation now accurately reflects the new two-phase architecture files with clear purpose descriptions.


158-198: Excellent two-phase architecture documentation.

The documentation clearly explains:

  • The purpose and flow of each phase
  • Key components with file locations
  • Property tests that validate the architecture

This will help contributors understand the evaluation flow quickly.


209-212: Clear provider method documentation.

The documentation succinctly describes both LLM provider methods and their use cases, helping developers choose the appropriate method for their phase implementation.

tests/gemini-provider.test.ts (5)

1-20: LGTM! Well-structured mock setup.

The SDK mock correctly isolates the provider from network calls while exposing a shared mock function for test control. The hoisted mock pattern with SHARED_GENERATE_CONTENT enables flexible per-test behavior configuration.


22-45: LGTM! Constructor tests cover essential configuration scenarios.

Tests validate both minimal and full configuration options without hitting the network.


47-216: LGTM! Comprehensive unstructured response handling tests.

Excellent coverage of edge cases including:

  • Whitespace trimming behavior
  • Markdown passthrough without modification
  • No JSON parsing for unstructured responses
  • Graceful handling of null/undefined usage metadata
  • Proper error wrapping and propagation

218-258: LGTM! Debug logging tests properly validate conditional behavior.

Good use of console spy with proper cleanup in afterEach. The tests verify that debug information is logged only when enabled and includes expected metadata.


323-416: LGTM! Structured response handling tests provide good coverage.

Tests validate JSON parsing success, parsing failures, and error propagation with appropriate error messages.

src/evaluators/index.ts (1)

29-51: LGTM! Clean barrel file organization for the two-phase architecture.

The exports are well-organized with clear comments explaining each group. Type exports correctly use the type keyword, and the ordering maintains the self-registration import at the end.

tests/suggestion-phase.test.ts (4)

1-47: LGTM! Well-structured test setup with comprehensive sample data.

The mock provider factory and sample issues array provide a solid foundation for testing the suggestion phase. The sample issues cover diverse criterion types (passive voice, vocabulary, wordy phrases).


48-128: LGTM! Run method tests validate core functionality.

Tests properly verify the LLM provider interaction, suggestion mapping, and result structure including the hasSuggestions convenience flag.


156-367: LGTM! Thorough property-based tests for suggestion-to-issue matching.

Excellent coverage of the matching contract including:

  • 1-based indexing semantics
  • Partial suggestion sets (LLM may not suggest for all issues)
  • Order preservation from LLM response
  • Field type validation (positive integer, non-empty strings)

369-465: LGTM! Zod validation tests ensure runtime type safety.

Good coverage of validation edge cases ensuring malformed LLM responses are rejected. This provides a safety net against LLM hallucinations producing invalid data structures.

tests/scoring-types.test.ts (4)

1-21: LGTM! Proper mock setup for two-phase evaluation testing.

The mock provider correctly implements both runPromptStructured and runPromptUnstructured methods. The beforeEach hook ensures clean state between tests by clearing mock call history.


40-141: LGTM! Judge evaluation tests validate two-phase integration.

The test properly mocks the detection phase with markdown-formatted issues and the suggestion phase with structured suggestions. The assertions verify that:

  • Scores are calculated correctly based on violations
  • Both criteria are represented in results
  • The suggestion phase is appropriately skipped when no issues are detected

183-288: LGTM! Check evaluation tests validate scoring mechanics.

The tests properly verify:

  • Score calculation based on violation density (10 - (2 * 2) = 6)
  • Percentage mapping (60%)
  • Perfect score for zero violations
  • Suggestion phase optimization (skipped when no issues)

291-344: LGTM! Technical accuracy evaluator test validates claim extraction flow.

Good use of vi.resetModules() and vi.doMock() to isolate the test from other tests' mock configurations. The test properly validates the perfect score path when no claims are extracted.

tests/azure-openai-provider.test.ts (4)

1-68: LGTM! Well-structured test setup with hoisted error classes.

The vi.hoisted pattern correctly lifts the error class definitions for use in the mock factory, enabling proper instanceof checks in tests. The mock structure matches the openai SDK's expected interface.


69-109: LGTM! Clean test setup with dynamic mock wiring.

The beforeEach hook properly wires the mocked validateApiResponse function, enabling per-test control over validation behavior. Constructor tests cover both minimal and full configuration scenarios.


111-360: LGTM! Comprehensive unstructured response handling tests.

Excellent coverage including:

  • Raw text extraction and trimming
  • Empty and null content error handling
  • Markdown passthrough
  • Temperature configuration propagation
  • Response validation errors
  • Unknown error wrapping

421-478: LGTM! Structured response handling test validates JSON parsing and usage mapping.

The test properly verifies that structured JSON responses are parsed correctly and usage metadata is mapped to the expected format (inputTokens, outputTokens).

tests/result-assembler.test.ts (5)

1-45: LGTM! Well-structured test helpers and clear test documentation.

The test file header clearly documents which properties are being tested (schema conformance and token aggregation), and the helper functions createDetectionIssues() and createSuggestions() provide clean, reusable test fixtures.


60-198: LGTM! Comprehensive schema conformance and edge case coverage.

The tests thoroughly verify the CheckResult structure including all required fields, proper typing, score ranges, and correct handling of missing or partial suggestions.


230-393: LGTM! Thorough testing of JudgeResult assembly and criterion grouping.

The tests properly verify criterion-level structure, violation grouping logic, score calculation based on violation count, and fallback behavior for missing suggestions.


396-460: LGTM! Complete coverage of token usage aggregation scenarios.

All edge cases are covered: both inputs undefined, single input defined, both inputs defined, zero values, and large token counts.


462-555: LGTM! Integration tests validate end-to-end result assembly.

The integration scenarios effectively combine detection issues, suggestions, and token aggregation to verify the complete workflow for both check and judge evaluation types.

tests/anthropic-provider.test.ts (6)

49-69: LGTM! Clean mock constructor pattern with error class attachment.

The mock constructor pattern properly attaches error classes as static properties, enabling instanceof checks in production code while keeping tests isolated from the network. This aligns with the coding guidelines for dependency injection in tests.


859-915: LGTM! Proper test setup for unstructured response handling.

The test suite follows the established pattern with proper mock setup in beforeEach and conditional cleanup in afterEach. The first test correctly verifies basic text extraction and token usage.


917-991: LGTM! Tests verify text block concatenation and markdown preservation.

These tests ensure multiple text blocks are properly combined and that markdown formatting is preserved as-is, which is essential for the detection phase that returns free-form markdown output.


993-1086: LGTM! Edge case handling for whitespace and missing content.

Tests properly verify whitespace trimming and appropriate error throwing when content blocks are empty or contain only non-text blocks.


1088-1141: LGTM! Comprehensive API error handling tests for unstructured calls.

The tests verify that different Anthropic error types (APIError, RateLimitError, AuthenticationError) are properly caught and re-thrown with appropriate error messages specific to unstructured calls.


1196-1228: LGTM! Critical test verifying tools are excluded from unstructured calls.

This test at lines 1196-1228 is essential for the two-phase architecture—it ensures that runPromptUnstructured does not include tools or tool_choice parameters, which would interfere with free-form text generation in the detection phase.

src/evaluators/base-evaluator.ts (4)

42-55: LGTM! Clean initialization of two-phase architecture components.

The constructor properly initializes all three components needed for the two-phase flow: DetectionPhaseRunner, SuggestionPhaseRunner, and ResultAssembler. The design follows good separation of concerns.


126-191: LGTM! Correct two-phase evaluation flow for judge results.

The implementation correctly:

  1. Runs detection per chunk with criteria context
  2. Only invokes suggestion phase when issues are detected
  3. Passes full document (not chunks) to suggestion phase for coherent suggestions
  4. Properly aggregates token usage from both phases

202-276: LGTM! Check evaluation follows same two-phase pattern with proper severity handling.

The severity resolution at line 252 correctly prioritizes defaultSeverity (constructor param) over prompt.meta.severity, allowing runtime override of prompt-specified severity. The strictness is passed through to the assembler for normalization.


278-296: LGTM! Clean criteria string builder with graceful empty handling.

The method produces a readable format for LLM prompts and correctly handles the edge case of empty or undefined criteria.

src/evaluators/suggestion-phase.ts (3)

24-56: LGTM! Well-designed interfaces for suggestion phase.

The Suggestion, SuggestionResult, and SuggestionPhaseOptions interfaces are clearly documented and provide a clean API surface. The hasSuggestions boolean in SuggestionResult is a helpful convenience property.


164-190: LGTM! Well-formatted issue sections for LLM consumption.

The markdown format is clear and structured, with 1-based indexing that correctly corresponds to the issueIndex expected in suggestion responses. The "(none)" fallback for empty context is a nice touch.


97-105: Verify: content parameter passed to both runPromptStructured and embedded in the prompt template.

The content is embedded in the prompt template via buildPrompt() (line 91), which replaces the {content} placeholder in the suggestion-phase template. This same content is also passed separately to runPromptStructured() (line 100) as the first argument. All LLM providers then receive content twice: once in the system prompt (via buildPromptBodyForStructured(promptText), which preserves the embedded content) and again in the user message (via Input:\n\n${content}). This redundancy is consistent across all LLM providers (OpenAI, Gemini, Azure OpenAI, Anthropic). Consider whether this duplication is intentional or should be refactored.

tests/base-evaluator-two-phase.test.ts (4)

20-68: LGTM! Well-designed mock fixtures for testing.

The MOCK_PROMPT_FILE and MOCK_PROMPT_FILE_JUDGE provide reusable test fixtures with realistic structure. The CREATE_MOCK_LLM_PROVIDER factory function creates fresh mocks per test, following the dependency injection pattern recommended in the coding guidelines.


320-374: LGTM! Critical test verifying full document context in suggestion phase.

This test is essential for the two-phase architecture—it verifies that even when content is chunked for detection, the suggestion phase receives the complete document. The assertion checking suggestionPhaseContent!.length > 1000 provides reasonable validation that the full content was passed.


499-552: LGTM! Tests verify criteria string building through public interface.

Good approach testing the private buildCriteriaString method through the public evaluate interface, verifying both populated criteria (with weights) and the graceful fallback for missing criteria.


667-848: LGTM! Comprehensive severity and strictness handling tests.

The tests cover all severity precedence levels:

  1. Default severity when none specified
  2. Prompt-specified severity
  3. Constructor parameter override

The strictness normalization test ensures string values like "strict" are handled correctly.

src/evaluators/result-assembler.ts (4)

1-20: LGTM!

Clean imports using import type for type-only imports, and the module-level documentation clearly explains the two-phase architecture responsibility.


21-33: LGTM!

The options interface is well-documented and uses narrow types appropriately. The union type for strictness provides good flexibility while maintaining type safety.


271-288: LGTM!

Clean functional approach that correctly handles all combinations of undefined inputs and produces a properly typed result.


333-346: LGTM!

The weighted score calculation is correct, with proper handling for the edge case of no criteria. The rounding logic correctly produces one decimal place precision.

src/evaluators/detection-phase.ts (7)

1-16: LGTM!

Clear module documentation and proper use of type-only imports following TypeScript best practices.


17-34: LGTM!

Well-documented interface with clear field definitions matching the expected LLM response structure.


36-56: LGTM!

Both interfaces are well-documented with appropriate optional fields.


69-113: LGTM!

The run method follows a clean flow: prompt construction → LLM call with retry → response parsing → result assembly. The conditional assignment of usage is a good pattern for optional fields.


121-124: LGTM!

Simple template replacement. If the template contains multiple {criteria} placeholders, consider using replaceAll() instead, but this is fine if only one occurrence is expected.


135-155: LGTM!

The parsing logic correctly handles the "no issues found" case and robustly splits the response into individual issue sections. The use of slice(1) to skip intro text is appropriate.


163-226: LGTM!

The parsing logic correctly extracts fields using well-crafted regex patterns with the /s flag for multiline content. Good graceful degradation with try-catch that returns null for malformed sections, and proper handling of the unknown error type per coding guidelines.

Comment on lines +206 to +239
const score: 1 | 2 | 3 | 4 =
violations.length === 0
? 4
: violations.length === 1
? 3
: violations.length <= 3
? 2
: 1;

// Build summary
const summary =
violations.length === 0
? `Pass: ${criterionName}`
: `Issue${violations.length > 1 ? "s" : ""} found with ${criterionName}`;

// Build reasoning
let reasoning: string;
if (violations.length === 0) {
reasoning = `Content meets the ${criterionName} criterion.`;
} else {
const parts: string[] = [
`${violations.length} violation${violations.length > 1 ? "s" : ""} of ${criterionName} found.`,
];
for (const v of violations) {
parts.push(`- "${v.quotedText}": ${v.analysis}`);
}
reasoning = parts.join("\n");
}

return {
name: criterionName,
weight,
score,
normalized_score: score * 2.5, // 1-4 to 1-10 scale
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Misleading comment: normalized_score range is 2.5–10, not 1–10.

The score ranges from 1 to 4, so score * 2.5 produces values in the range 2.5–10, not 1–10 as stated in the comment. This may confuse maintainers or cause issues if downstream code expects a true 1–10 scale.

If a 1–10 scale is desired, consider adjusting the formula (e.g., (score - 1) * 3 + 1 would map 1→1, 4→10).

Suggested comment fix
-          normalized_score: score * 2.5, // 1-4 to 1-10 scale
+          normalized_score: score * 2.5, // 1-4 to 2.5-10 scale
🤖 Prompt for AI Agents
In @src/evaluators/result-assembler.ts around lines 206 - 239, The comment
claiming normalized_score maps 1–4 to 1–10 is incorrect: normalized_score =
score * 2.5 yields 2.5–10; update the code in the result assembly to either (A)
correct the comment to state the actual 2.5–10 range for normalized_score, or
(B) change the normalization formula to map 1→1 and 4→10 by replacing score *
2.5 with the linear mapping (score - 1) * 3 + 1 (apply this change where
normalized_score is computed and keep the comment consistent).

Comment on lines +1 to +6
/**
* Retry utility with logging for LLM operations.
*
* Provides exponential backoff retry logic with detailed logging for debugging
* transient failures in LLM API calls.
*/
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Documentation claims exponential backoff, but implementation has none.

The JSDoc states "exponential backoff retry logic" but the implementation immediately retries without any delay between attempts. This could overwhelm APIs during transient failures and defeat the purpose of retrying.

🔧 Proposed fix to add exponential backoff
+const sleep = (ms: number) => new Promise(resolve => setTimeout(resolve, ms));
+
 export async function withRetry<T>(
   operation: () => Promise<T>,
   options: RetryOptions
 ): Promise<T> {
-  const { maxRetries = 3, context } = options;
+  const { maxRetries = 3, context, baseDelayMs = 1000 } = options;
   let lastError: unknown;
 
   for (let attempt = 1; attempt <= maxRetries; attempt++) {
     try {
       const data = await operation();
       if (attempt > 1) {
         console.log(
           `[vectorlint] ${context}: Success on attempt ${attempt}/${maxRetries}`
         );
       }
       return data;
     } catch (error) {
       lastError = error;
       if (attempt < maxRetries) {
+        const delay = baseDelayMs * Math.pow(2, attempt - 1);
         console.log(
-          `[vectorlint] ${context}: Attempt ${attempt}/${maxRetries} failed, retrying...`
+          `[vectorlint] ${context}: Attempt ${attempt}/${maxRetries} failed, retrying in ${delay}ms...`
         );
+        await sleep(delay);
       } else {
         console.log(
           `[vectorlint] ${context}: All ${maxRetries} attempts exhausted`
         );
       }
     }
   }
 
   throw lastError;
 }

Update RetryOptions interface:

 export interface RetryOptions {
   /** Maximum number of retry attempts (default: 3) */
   maxRetries?: number;
   /** Context string for logging (e.g., "detection phase", "suggestion phase") */
   context: string;
+  /** Base delay in milliseconds for exponential backoff (default: 1000) */
+  baseDelayMs?: number;
 }

Comment on lines +30 to +31
const { maxRetries = 3, context } = options;
let lastError: unknown;
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Edge case: maxRetries = 0 throws undefined.

If a caller passes maxRetries: 0, the loop never executes and lastError remains undefined, which would throw a confusing undefined error.

🔧 Proposed fix
 export async function withRetry<T>(
   operation: () => Promise<T>,
   options: RetryOptions
 ): Promise<T> {
   const { maxRetries = 3, context } = options;
+  
+  if (maxRetries < 1) {
+    throw new Error(`${context}: maxRetries must be at least 1`);
+  }
+  
   let lastError: unknown;
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
const { maxRetries = 3, context } = options;
let lastError: unknown;
const { maxRetries = 3, context } = options;
if (maxRetries < 1) {
throw new Error(`${context}: maxRetries must be at least 1`);
}
let lastError: unknown;
🤖 Prompt for AI Agents
In @src/evaluators/retry.ts around lines 30 - 31, Initialize and/or properly
handle the case when maxRetries can be 0 so lastError is never left undefined;
either validate maxRetries to be >= 1 up front or set lastError to a meaningful
default Error before the retry loop (or after the loop check) and then throw
that descriptive error instead of undefined. Update the logic around maxRetries,
lastError, and the retry loop in retry.ts (symbols: maxRetries, lastError, the
retry loop) so callers passing maxRetries: 0 get a clear error message rather
than an undefined throw.

Comment on lines +96 to +138
async runPromptUnstructured(content: string, promptText: string): Promise<LLMResult<string>> {
const systemPrompt = this.builder.buildPromptBodyForStructured(promptText);

const fullPrompt = `${systemPrompt}

Input:
${content}
`;

if (this.config.debug) {
console.error('[vectorlint] Sending unstructured request to Gemini:', {
model: this.config.model,
temperature: this.config.temperature,
});
if (this.config.showPrompt) {
console.error('[vectorlint] Full prompt:');
console.error(fullPrompt);
} else if (this.config.showPromptTrunc) {
console.error('[vectorlint] Prompt preview (first 500 chars):');
console.error(fullPrompt.slice(0, 500));
if (fullPrompt.length > 500) console.error('... [truncated]');
}
}

try {
const result = await this.model.generateContent(fullPrompt);
const response = result.response;
const text = response.text();
const usageMetadata = response.usageMetadata;

const llmResult: LLMResult<string> = { data: text.trim() };
if (usageMetadata) {
llmResult.usage = {
inputTokens: usageMetadata.promptTokenCount ?? 0,
outputTokens: usageMetadata.candidatesTokenCount ?? 0,
};
}
return llmResult;
} catch (e: unknown) {
const err = handleUnknownError(e, 'Gemini API call');
throw new Error(`Gemini API call failed: ${err.message}`);
}
}
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Model configured for JSON output is reused for unstructured calls.

The this.model instance is initialized in the constructor with responseMimeType: "application/json" (line 37), which instructs Gemini to always return JSON-formatted output. When runPromptUnstructured calls this.model.generateContent(fullPrompt), the API will still attempt to format the response as JSON, which conflicts with the detection phase's expectation of free-form markdown text.

🔧 Suggested fix: Create an unstructured model on-demand
 async runPromptUnstructured(content: string, promptText: string): Promise<LLMResult<string>> {
     const systemPrompt = this.builder.buildPromptBodyForStructured(promptText);

     const fullPrompt = `${systemPrompt}

         Input:
         ${content}
     `;

     if (this.config.debug) {
         console.error('[vectorlint] Sending unstructured request to Gemini:', {
             model: this.config.model,
             temperature: this.config.temperature,
         });
         if (this.config.showPrompt) {
             console.error('[vectorlint] Full prompt:');
             console.error(fullPrompt);
         } else if (this.config.showPromptTrunc) {
             console.error('[vectorlint] Prompt preview (first 500 chars):');
             console.error(fullPrompt.slice(0, 500));
             if (fullPrompt.length > 500) console.error('... [truncated]');
         }
     }

     try {
+        // Create a model without JSON response type for unstructured text output
+        const unstructuredModel = this.client.getGenerativeModel({
+            model: this.config.model!,
+            generationConfig: {
+                ...(this.config.temperature !== undefined && { temperature: this.config.temperature }),
+            }
+        });
-        const result = await this.model.generateContent(fullPrompt);
+        const result = await unstructuredModel.generateContent(fullPrompt);
         const response = result.response;
         const text = response.text();
🤖 Prompt for AI Agents
In @src/providers/gemini-provider.ts around lines 96 - 138, The model instance
created in the constructor was set with responseMimeType: "application/json"
causing runPromptUnstructured to still request JSON; update
runPromptUnstructured to create or obtain a temporary unstructured model (e.g.,
clone or instantiate a new Model/Client without responseMimeType or with
responseMimeType: "text/markdown" or "text/plain") and call generateContent on
that unstructured model instead of this.model; ensure you reuse the same
client/config values (model name, temperature) and clean up or reuse the
temporary model reference as appropriate to avoid changing the
constructor-initialized JSON model used elsewhere.

throw new Error('Empty response from OpenAI API (no choices).');
}

const responseText = firstChoice.message.content?.trim() ?? '';
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Inconsistent empty content handling vs. structured path.

The structured path (line 155-157) throws an error when responseText is empty, but the unstructured path falls back to an empty string with ?? ''. This inconsistency could mask issues where the LLM returns no content.

🔧 Suggested fix for consistency
-    const responseText = firstChoice.message.content?.trim() ?? '';
+    const responseText = firstChoice.message.content?.trim();
+    if (!responseText) {
+      throw new Error('Empty response from OpenAI API (no content).');
+    }

Alternatively, if empty responses are valid for unstructured calls, document this intentional difference.

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
const responseText = firstChoice.message.content?.trim() ?? '';
const responseText = firstChoice.message.content?.trim();
if (!responseText) {
throw new Error('Empty response from OpenAI API (no content).');
}

Comment on lines +362 to +419
describe('Debugging and Logging', () => {
let consoleSpy: ReturnType<typeof vi.spyOn>;

beforeEach(() => {
consoleSpy = vi.spyOn(console, 'log').mockImplementation(() => { });
});

afterEach(() => {
consoleSpy.mockRestore();
});

it('logs debug information for unstructured calls', async () => {
const config = {
apiKey: 'sk-test-key',
endpoint: 'https://test.openai.azure.com',
deploymentName: 'gpt-4o',
debug: true,
};

const mockResponse: OpenAIResponse = {
choices: [
{
message: {
content: 'Response text',
},
finish_reason: 'stop',
},
],
usage: {
prompt_tokens: 50,
completion_tokens: 20,
total_tokens: 70,
},
};

SHARED_CREATE.mockResolvedValue(mockResponse);

const provider = new AzureOpenAIProvider(config);
await provider.runPromptUnstructured('Test content', 'Test prompt');

expect(consoleSpy).toHaveBeenCalledWith(
'[vectorlint] Sending unstructured request to Azure OpenAI:',
expect.objectContaining({
model: 'gpt-4o',
temperature: undefined,
})
);

expect(consoleSpy).toHaveBeenCalledWith(
'[vectorlint] LLM response meta:',
expect.objectContaining({
// eslint-disable-next-line @typescript-eslint/no-unsafe-assignment
usage: expect.anything(),
finish_reason: 'stop',
})
);
});
});
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
echo "=== Checking console logging in provider implementations ==="
rg -n "console\.(log|error|warn)" src/providers/ --type ts -C 2

Repository: TRocket-Labs/vectorlint

Length of output: 16050


Standardize debug logging across all providers.

The Azure test correctly spies on console.log, which matches the Azure OpenAI implementation. However, the Gemini provider is the only one using console.error for debug logging, while Azure, OpenAI, Anthropic, and Perplexity all use console.log. Align the Gemini provider to use console.log for consistency across all providers.

🤖 Prompt for AI Agents
In @tests/azure-openai-provider.test.ts around lines 362 - 419, The Gemini
provider is logging debug output with console.error while other providers use
console.log; locate the Gemini provider class (e.g., GeminiProvider) and replace
any debug console.error calls in its debug/logging code paths (methods like
runPromptUnstructured / runPromptStructured or any place using console.error for
"[vectorlint]" debug messages) with console.log, preserving the exact log
message and payload shape so tests and consumers remain consistent.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant